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Abstract 

We define, analyze, and give efficient algorithms for two kinds of distance measures for 
rooted and unrooted phylogenies. For rooted trees, our measures are based on the topologies 
the input trees induce on triplets; that is, on three-element subsets of the set of species. For 
unrooted trees, the measures are based on quartets (four-element subsets). Triplet and quartet- 
based distances provide a robust and fine-grained measure of the similarities between trees. 
The distinguishing feature of our distance measures relative to traditional quartet and triplet 
distances is their ability to deal cleanly with the presence of unresolved nodes, also called 
polytomies. For rooted trees, these are nodes with more than two children; for unrooted trees, 
they are nodes of degree greater than three. 

Our first class of measures are parametric distances, where there is a parameter that weighs 
the difference between an unresolved triplet/quartet topology and a resolved one. Our second 
class of measures are based on Hausdorff distance. Each tree is viewed as a set of all possible 
ways in which the tree could be refined to eliminate unresolved nodes. The distance between 
the original (unresolved) trees is then taken to be the Hausdorff distance between the associated 
sets of fully resolved trees, where the distance between trees in the sets is the triplet or quartet 
distance, as appropriate. 

Keywords. Aggregation, Hausdorff distance, phylogenetic trees, quartet distance, triplet 
distance. 



1 Introduction 

Evolutionary trees, also known as phylogenetic trees or phylogenies, represent the evolutionary 
history of sets of species. Such trees have uniquely labeled leaves, corresponding to the species, 
and unlabeled internal nodes, representing hypothetical ancestors. The trees can be either rooted, 
if the evolutionary origin is known, or unrooted, otherwise. 
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This paper addresses two related questions: (1) How does one measure how close two evolu- 
tionary trees are to each other? (2) How does one combine or aggregate the phylogenetic infor- 
mation from conflicting trees into a single consensus treel Among the motivations for the first 
question is the growth of phylogenetic databases, such as TreeBase [28J, with the attendant need 
for sophisticated querying mechanisms and for means to assess the quality of answers to queries. 
The second question arises from the fact that phylogenetic analyses — e.g., by parsimony [|22l — 
typically produce multiple evolutionary trees (often in the thousands) for the same set of species. 
Another motivation is the ongoing effort to assemble the tree of life by piecing together phyloge- 
nies for subsets of species fTTl. 

We address the above questions by defining appropriate distance measures between trees. 
While several such measures have been proposed before (see below), ours provide a feature that 
previous ones do not: The ability to deal cleanly with the presence of unresolved nodes, also called 
polytomies. For rooted trees these are nodes with more than two children; for unrooted trees, they 
are nodes of degree greater than three. Polytomies cannot simply be ignored, since they arise 
naturally in phylogenetic analysis. Furthermore, they must be treated with care: A node may be 
unresolved because it truly must be so or because there is not enough evidence to break it up into 
resolved nodes — that is, the polytomies are either "hard" or "soft" Il26ll . 

Our contributions. We define and analyze two kinds of distance measures for phylogenies. For 
rooted trees, our measures are based on the topologies the input trees induce on triplets; that is, 
on three-element subsets of the set of species. For unrooted trees, the measures are based on 
quartets (four-element subsets). Our approach is motivated by the observation that triplet and 
quartet topologies are the basic building blocks of rooted and unrooted trees, in the sense that 
they are the smallest topological units that completely identify a phylogenetic tree f30l . Triplet 
and quartet-based distances thus provide a robust and fine-grained measure of the differences and 
similarities between tree^. In contrast with traditional quartet and triplet distances, our two classes 
of distance measures deal cleanly with the presence of unresolved nodes. Each of them does so in 
a different way. 

The first kind of measures we propose are parametric distances: Given a triplet (quartet) X, 
we compare the topologies that each of the two input trees induces on X. If they are identical, the 
contribution of X to the distance is zero. If both topologies are fully resolved but different, then 
the contribution is one. Otherwise, the topology is resolved in one of the trees, but not the other. In 
this case, X contributes p to the distance, where j9 is a real number between and 1. Parameter p 
allows one to make a smooth transition between hard and soft views of polytomy. At one extreme, 
if p = 1, an unresolved topology is viewed as different from a fully resolved one. At the other, 
when p = 0, unresolved topologies are viewed as identical to resolved ones. Intermediate values 
of p allow one to adjust for the degree of certainty one has about a polytomy. 

The second kind of measures proposed here are based on viewing each tree as a set of all 
possible fully resolved trees that can be obtained from it by refining its unresolved nodes. The 
distance between two trees is defined as the Hausdorff distance between the corresponding set^ 

'Biologically-inspired arguments in favor of triplet-based measures can be found in 1 13|. 

^Informally, two sets A and B are at Hausdorff distance r of each other if each element of A is within distance r 
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where the distance between trees in the sets is the triplet or quartet distance, as appropriate. 

After defining our distance measures, we proceed to study their mathematical and algorithmic 
properties. We obtain exact and asymptotic bounds on expected values of parametric triplet dis- 
tance and parametric quartet distance. We also study for which values of p, parametric triplet and 
quartet distances are metrics, near-metrics (in the sense of |fT9l ), or non-metrics. 

Aside from the mathematical elegance that metrics and near-metrics bring to tree comparison, 
there are also algorithmic benefits. We formulate phylogeny aggregation as a median problem, in 
which the objective is to find a consensus tree whose total distance to the given trees is minimized. 
We do not know whether finding the median tree relative to parametric (triplet or quartet) distance 
is NP-hard, but conjecture that it is. This is suggested by the NP-completeness of the maximum 
triplet compatibility problen^ [9|. However, by the results mentioned above and well-known facts 
about the median problem [|36l . there are simple constant-factor approximation algorithms for the 
aggregation of rooted and unrooted trees relative to parametric distance: Simply return the input 
tree with minimum distance to the remaining input trees. We show that there are values of p for 
which parametric distance is a metric, but the median tree may not be fully resolved even if all the 
input trees are. However, beyond a threshold, the median tree is guaranteed to be fully resolved if 
the input trees are fully resolved. 

A natural problem is whether Hausdorff triplet (quartet) distance between two trees can be 
computed in polynomial time. We suspect that computing Hausdorff triplet (quartet) distance is 
NP-hard. However, even if this were so, we show that one can partially circumvent the issue by 
proving that, under a certain density assumption, Hausdorff distance is within a constant factor of 
parametric distance — that is, the measures are equivalent in the sense of [[191 . 

Finally, we present a 0(r2^)-time algorithm to compute parametric triplet distance and a 0{n'^) 
2-approximate algorithm for parametric quartet distance. To our knowledge, there was no previous 
algorithm for computing the parametric triplet distance between two rooted trees, other than by 
enumerating all 0(n^) triplets. Two algorithms exist that can be directly applied to compute the 
parametric quartet distance (see also IfTTl ). One runs in time 0{n'^ mm{di, ^2}), where, for i E 
{1, 2}, di is the maximum degree of a node in Tj [fT2]l : the other takes 0{d^n \ogn) time, where 
d is the maximum degree of a node in Ti and T2 ^3410 Our faster 0{n^) algorithm offers a 
2-approximate solution when an exact value of the parametric quartet distance is not required. 
Additionally, our algorithm gives the exact answer when p = \- 

Related work. Several other measures for comparing trees have been proposed; we mention a 
few. A popular class of distances are those based on symmetric distance between sets of clusters 
(that is, on sets of species that descend from the same internal node in a rooted tree) or of splits 
(partitions of the set of species induced by the removal of an edge in an unrooted tree); the latter 
is the well-known Robinson-Foulds (RF) distance [|29ll . It is not hard to show that two rooted 

of B and vice-versa. For a formal definition, see Section |3] 

^The input to this problem consists of a set of trees, each of which has three leaves; the leaf sets of these trees may 
not be identical. The question is to find the largest subset of these triplet trees such that all of the trees are consistent 
with a single tree T whose leaf set is the union of the leaves of the input triplet trees. 

■^Note that the presence of unresolved nodes seems to complicate distance computation. Indeed, the quartet distance 
between a pair of fully resolved unrooted trees can be obtained in 0{n log n) time [81. 
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(unrooted) trees can share many triplet (quartet) topologies but not share a single cluster (split). 
Cluster- and split-based measures are also coarser than triplet and quartet distances. 

One can also measure the distance between two trees by counting the number of branch- 
swapping operations — e.g., nearest-neighbor interchange or subtree pruning and regrafting op- 
erations [22] — needed to convert one of the trees into the other |l3l. However, the associated 
measures can be hard to compute, and they fail to distinguish between operations that affect many 
species and those that affect only a few. An alternative to distance measures are similarity methods 
such as maximum agreement subtree (MAST) approach f23l. While there are efficient algorithms 
for computing the MAST |ET1, the measure is coarser than triplet-based distances. 

There is an extensive literature on consensus methods for phylogenetic trees. A non-exhaustive 
list of methods based on splits or clusters includes strict consensus trees [|27ll . majority -rule trees 
im, and the Adams consensus [IJ. In local consensus methods, the goal is to find a consensus tree 
that satisfies a given set of constraints on the topology of each triplet Il24l . For a thorough survey 
of these methods, their properties and interrelationships, see [TOl. 

The fact that consensus methods tend to produce unresolved trees, with an attendant loss of 
information, has been observed before. An alternative approach is to provide multiple consensus 
trees, instead of a single one. The idea, developed more fully in [35], is to cluster the input trees 
using some distance measure into groups, each of which is represented by a single consensus tree, 
in such a way as to minimize some measure of information loss. Our distance measures can be 
used within this framework, where their fine-grained nature could conceivably offer advantages 
over other techniques. 

In addition to consensus methods, there are techniques that take as input sets of quartet trees 
or triplet trees and try to find large compatible subsets or subsets whose removal results in a com- 
patible set L6i ^1 1 . These problems are related to the supertree problem, in which a set of input 
trees that may not all share the same species is given and the problem is to find a single tree that 
exhibits as much as possible of the evolutionary relationships among the input trees |I3. Thus, the 
consensus problem for trees is a special case of the supertree problem. 

The consensus problem on trees exhibits parallels with the rank aggregation problem, a prob- 
lem with a rich history and which has recently found applications to Internet search ^ [51 [141 [IS 
[25l[l8l[19l. Here, we are given a collection of rankings (that is, permutations) of n objects, and 
the goal is to find a ranking of minimum total distance to the input rankings. A distance between 
rankings of particular interest is Kendall's tau, defined as the number of pairwise disagreements 
between the two rankings. Like triplet and quartet distances, Kendall's tau is based on elementary 
ordering relationships. Rank aggregation under Kendall's tau was shown to be NP-complete even 
for four lists by Dwork et al. [18J. 

A permutation is the analog of a fully resolved tree, since every pairwise relationship between 
elements is given. The analog to a partially-resolved tree is a partial ranking, in which the el- 
ements are grouped into an ordered list of buckets, such that elements in different buckets have 
known ordering relationships, but elements within a bucket are not ranked [19|. Our definitions 
of parametric distance and Hausdorff distance are inspired by Fagin et al.'s Kendall tau with pa- 
rameter p and their Hausdorff version of Kendall's tau, respectively |[T9l . We note, however, that 
aggregating partial rankings seems computationally easier than the consensus problem on trees. 
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For example, while the Hausdorff version of Kendall's tau has a simple and easily-computable ex- 
pression [|T4l[T9l . it is unclear whether the Hausdorff triplet or quartet distances are polynomially- 
computable for trees. 

Organization of the paper. Section[2]reviews basic notions in phylogenetics and distances. Our 
distance measures and the consensus problem are formally defined in Section [3l The expected 
values of the distance measures are studied in Section |4l The basic properties of parametric dis- 
tance are proved in Section [51 Section [6] studies the connection between Hausdorff and parametric 
distances. Section |7] gives efficient algorithms for computing parametric triplet distance. A 2- 
approximation algorithm for parametric quartet distance is given in Section [8l 

2 Preliminaries 

Phylogenies. By and large, we follow standard terminology (i.e., similar to ^ and [[301 '). We 
write [N] to denote the set {1, 2, ... , N}, where is a positive integer. 

Let T be a rooted or unrooted tree. We write V(T), £(T), and C{T) to denote, respectively, 
the node set, edge set, and leaf set of T. A taxon (plural taxa) is some basic unit of classification; 
e.g., a species. Let 5 be a set of taxa. A phylogenetic tree or phylogeny for 5 is a tree T such 
that C{T) = S. Furthermore, if T is rooted, we require that every internal node have at least two 
children; if T is unrooted, every internal node is required to have degree at least three. We write 
RP{n) to denote the set of all rooted phylogenetic trees over S = [n] and P(n) to denote the set 
of all unrooted phylogenetic trees over S = [n]. 

An internal node in a rooted phylogeny is resolved if it has exactly two children; otherwise it is 
unresolved. Similarly, an internal node in an unrooted phylogeny is resolved if it has degree three, 
and unresolved otherwise. Unresolved nodes in rooted and unrooted trees are also referred to as 
polytomies or multifurcations. A phylogeny (rooted or unrooted) h fully resolved if all its internal 
nodes are resolved. A fan is a completely unresolved phylogeny; i.e., it contains a single internal 
node, to which all leaves are connected (if the phylogeny is rooted, this internal node is the root). 

A contraction of a phylogeny T is obtained by deleting an internal edge and identifying its 
endpoints. A phylogeny T2 is a refinement of phylogeny Ti, denoted Ti ^ T2, if and only if Ti can 
be obtained from T2 through or more contractions. Tree T2 is a/wZZ refinement of Ti if Ti ^ T2 
and T2 is fully resolved. We write JF(T) to denote the set of all full refinements of T. 

Let X be a subset of C{T) and let T[X] denote the minimal subtree of T having X as its leaf 
set. The restriction of T to X, denoted T\X, is the phylogeny for X defined as follows. If T is 
unrooted, then T\X is the tree obtained from T\X] by suppressing all degree-two nodes. If T is 
rooted, T\X is obtained from T\X] by suppressing all degree-two nodes except for the root. 

A triplet is a three-element subset of S. A triplet tree is a rooted phylogeny whose leaf set is 
a triplet. The triplet tree with leaf set {a, 6, c} is denoted by a\hc if the path from 6 to c does not 
intersect the path from a to the root. A quartet is a four-element subset of S and a quartet tree 
is an unrooted phylogeny whose leaf set is a quartet. The quartet tree with leaf set {a, 6, c, d] is 
denoted by ah\cd if the path from a to 6 does not intersect the path from do d. A triplet (quartet) 
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X is said to be resolved in a phylogenetic tree T over S" if T|X is fully resolved; otherwise, X is 
unresolved. 

Finally, we introduce notation for certain useful subtrees of a tree T. Suppose T is rooted and v 
is a node in T. Then, T{v) denotes the subtree of T rooted at v. Suppose T is unrooted and {n, v} 
is an edge in T. Removal of edge {m, v} splits the tree T into two subtrees. We denote the subtree 
that contains node u by T(u, t>), and the subtree that contains v by T(^;, u). 

Distance measures, metrics, and near-metrics. A distance measure on a set D is a binary 
function d on D satisfying the following three conditions: (i) d(x, y) > for all x,y E D; 
(ii) d{x,y) = d{y,x) for all x,y E D; and (iii) d{x,y) = if and only if x = y. Function 
d is a metric if, in addition to being a distance measure, it satisfies the triangle inequality; i.e., 
z) < d{x, y) + d{y, z) for all x,y, z E D. Distance measure d is a near-metric if there is 
a constant c, independent of the size of D, such that d satisfies the relaxed polygonal inequality: 
d{x, z) < c{d{x, xi) + d{xi, X2) + ■ — h d{xn-i, z j) for all n > 1 and x, z,Xi, . . . , Xn-i E D [19]. 
Two distance measures d and d' with domain D are equivalent if there are constants Ci, C2 > 
such that cid'{x, y) < d{x, y) < C2d'{x, y) for every pair x,y E D [fT9l . 

3 Distance measures for phylogenies 

Here we define the distance measures for rooted and unrooted trees to be studied in the rest of 
the paper. We use essentially the same notation for the rooted tree measures as for the unrooted 
tree measures. We do so because the concepts for each case are close analogs of those for the 
other, the key difference being the use of triplets in one setting (rooted trees) and of quartets in the 
other (unrooted trees). It will be easy to distinguish between the two settings by simply specifying 
the context in which the measures are being applied. Our notation has the benefits of reducing 
repetitiveness and of allowing us to avoid excessive use of subscripts and superscripts. 

Let Ti and T2 be any two rooted (respectively, unrooted) phylogenies over taxon set [n] . Define 
the following five sets of triplets (quartets) over [n] . 

S(Ti,T2): The set of all triplets (quartets) X such that Ti\X and T2IX are fully resolved, and 

Ti\X = T2\X. 

V{Ti,T2): The set of all triplets (quartets) X such that Ti\X and T2IX are fully resolved, and 

T^\X^T2\X. 

7?.i(Ti, T2): The set of all triplets (quartets) X such that Ti |X is fully resolved, but TajX is not. 
7^.2 (Ti, T2): The set of all triplets (quartets) X such that T2IX is fully resolved, but Ti|X is not. 
W(Ti, T2): The set of all triplets (quartets) X such that Ti|X and T2IX are unresolved. 
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Let p be a real number in the interval [0, 1]. The parametric triplet (quartet) distance between 
Ti and T2 is defined a^ 

c?(P)(Ti,T2) = |P(Ti,T2)| +p(|7^l(^l,T2)| + |7^2(^l,T2)|) . (1) 

When the domain of S'^^ is restricted to fully resolved trees, and thus TZi{Ti, T2) = 7?.2(Ti, T2) = 
W(Ti, T2) = 0, we refer to it simply as the triplet (quartet) distance. 

Parameter p allows one to make a smooth transition from soft to hard views of polytomy : When 
p = 0, resolved triplets (quartets) are treated as equal to unresolved ones, while when p = 1, they 
are treated as being completely different. Choosing intermediate values of p allows one to adjust 
for the amount of evidence required to resolve a polytomj]^. 

An alternative distance measure (inspired by References |fT9l [T4l ). is the Hausdorff distance, 
defined as follows. Let d be a metric over fully resolved trees. Metric d is extended to partially 
resolved trees as follows. 

i^HausfT"!, T2) = max < max min d(ti,t2), max min d(ti,t2)> (2) 

l^tiG^(Ti)t2GJ^{T2) t2e^(T2)iiGJ^{Ti) J 

When d is the triplet (quartet) distance, d^^us is called the Hausdorjf triplet (quartet) distance. 

Definition Q requires some explanation. The quantity mmt2ej^{T2) d{ti,t2) is the distance 
between ti and the set of full refinements of T2. Hence, 

max min d(ti,t2) 

ti&r{Ti)t2£r(T2) 

is the maximum distance between a full refinement of Ti and the set of full refinements of T2. 
Similarly, 

max min d{ti,t2) 

is the maximum distance between a full refinement of T2 and the set of full refinements of Ti. 
Therefore, Ti and T2 are at Hausdorff distance r of each other if every full refinement of Ti is 
within distance r of a full refinement of T2 and vice- versa. 



Aggregating phylogenies. Let A; be a positive integer and S" be a set of taxa. A profile of length 
k (or simply a profile, when k is understood from the context) is a mapping V that assigns each 
i G [k] a. phylogenetic tree V{i) over S. We refer to these trees as input trees. A consensus rule is 
a function that maps a profile V to some phylogenetic tree T over S called a consensus tree. 

Let dhe a distance measure whose domain is the set of phylogenies over S. We extend d to 
define a distance measure from profiles to phylogenies as (i(T,P) = J2i=i d(T,V{i)). A consensus 
rule is a median rule for d if for every profile V it returns a phylogeny T* of minimum distance to 
V; such a T* is called a median. The problem of finding a median for a profile with respect to a 
distance measure d is referred to as the median problem (relative d), or as the aggregation problem. 

^Note that the sets S{Ti , T2) !mAU{Ti , T2) are not used in the definition of d'^P\ but are needed for other purposes. 
^We note that parametric triplet/quartet distance is a profile-based metric, in the sense of |fT9l . However, the use of 
the word "profile" in fT9l is quite different from our use of the term. 
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4 Expected parametric triplet and quartet distances 

We now consider the expected value of parametric triplet and quartet distances. Let u{n) and r(n) 
denote the probabilities that a given quartet is, respectively, unresolved or resolved in an unrooted 
phylogeny chosen uniformly at random from P(n); thus, u{n) = 1 — r{n). The following are the 
two main results of this section. 

Theorem 4.1. Let Ti and T2 be two unrooted phylogenies chosen uniformly at random with re- 
placement from P[n). Then, 

i?(rf(^)(Ti,T2))= Q ■ {^■r{nf + 2-p-T{n)-u{n)^. (3) 

Theorem 4.2. Let Ti and T2 be two rooted phylogenies chosen uniformly at random with replace- 
ment from RP{n). Then, 

E{Sp\Ti,T2)) = Q ■ 0-r(n + l)2 + 2-p-r(n + l)-u(n+l)^ . (4) 

It is known [[331 [32l that 

/ 7r(21n2-l) 

^(^)~v — ^ — ■ 

Together with Theorems O and g^l this implies that E{d'-P\Ti,T2)) is asymptotically | ■ (4) for 
unrooted trees and | ■ (g) for rooted trees. 

The proof of Theorem 14. 1 1 follows directly from the work of Day fTSl; hence, it is omitted 
(however, we should note that the proof is similar to that of Lemma l4~n below) . In the remainder 
of this section, we give a proof of Theorem l4.2[ 

We need some notation. Let u'{n) and r'{n) denote the probabilities that a given triplet is, 
respectively, unresolved or resolved in an rooted phylogeny chosen at random from RP{n). 

Lemma 4.1. Let Ti and T2 be two rooted phylogenies chosen uniformly at random with replace- 
ment from RP{n). Then, 

E{S^\T,,T2)) = (^^ ■ (^^■r'{nf + 2-p-r'{n)-u'{n)y (6) 

Proof. By the definition of d^P^ and the linearity of expectation, it suffices to establish the equalities 
below. 

E{V{T,,T2))=(^^-l-r'{ny (7) 

E{niiT,,T2)) = E(7^2(Tl,T2)) = Q ■r'in)-u'in)) (8) 

To establish Equation ([7]), consider a triplet X. The probability that X is resolved in Ti (or T2) 
is r'(n). Thus, the probability that X is resolved in both Ti and T2 is r'(nY. There are exactly 
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three different ways in which any given triplet can be resolved. Hence, if a is resolved in both Ti 
and T2, the probability that it is resolved differently in both trees is |. Thus, the probability of a 
pre-given triplet being resolved in both Ti and T2, but with different types in each, is |r'(n)^. By 
the linearity of expectation and since the total number of triplets from £(Ti) (and C{T2)) is (3), 

i?(P(Ti,T2)) = g)-|rW- 

To establish Equation ®, we only need to study E{TZi{Ti, T2)); the expression for E(Ji2{Ti,T2)) 
follows by symmetry. Consider a triplet X. The probability that X is unresolved in Ti is u'{n) and 
the probability that X is resolved in T2 is r'{n). The expression for £'(7^i(Ti, T2)) now follows by 
linearity of expectation. □ 

Let us define the function Add-Leaf : RP{n) — > P{n + 1) as follows. Given a rooted tree 
T G RP{n), Add-Leaf(T) is the unrooted tree constructed from T by (1) adding a leaf node 
labeled n + 1 to T by adjoining it to the root node of T and (2) unrooting the resulting tree. The 
next two lemmas are well known (for proofs, see |l33l|22l and |l30l p. 20], respectively). 



Lemma 4.2. For all n > 1, \RP{n) \ = \P{n + 1)|. 

Lemma 4.3. Function Add-Leaf is a bijectionfrom the set RP{n) to the set P{n + 1). 

For any triplet X over [n], we define two functions gx '■ RP{n) {0, 1} and fx ■ P{n + 1) ^ 
{0, 1} as follows: 

(1 if triplet X is resolved in tree T 
1 otherwise 

^ 1 1 if quartet XU{ra + l}is resolved in tree T ^^^^ 

1 otherwise 

We have the following result. 

Lemma 4.4. Let X be any triplet over [n]. Consider a tree T G RP{n), and let T' = Add-Leaf (T). 
Then, fxiT') = gxin 

Proof. Follows from the observation that triplet X is resolved in T if and only if quartet XU{n+l} 
is resolved in T'. □ 

Lemma 4.5. For all n > 1, r'{n) = r{n + 1) and u'{n) = u{n + 1). 

Proof. Let X be any triplet over [n] . By definition, r{n+l) is the probability of any given quartet 
being resolved in a random unrooted tree in P{n). In particular, r{n + 1) is the probability that 
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quartet X U {n + 1} is resolved in a random unrooted tree. Now, 



r[n + 1) = y 



T6P(n+l) 



P(n + 1)1 
fx{T) 



S \RP(n 

TeP(n+l) ' ^ 



\RPin) 



= E 

T'e-RP(n) 

= r'{n), 

where the first and last equalities follow from the definitions of r(n+ 1) and r{n), respectively, the 
second equality follows from Lemma |4~2l and the third follows from Lemma |43] and Lemma |44l 
Since u'{n) = 1 — r'(n) and u{n + 1) = 1 — r(n + 1), it follows that u'{n) = u{n + 1). □ 

Proof of Theorem \4~2\ Simply substitute the expressions for r'(n) and u'{n) given in Lemma |43] 
into the expression for E[S'p'> (Ti, T2)) given in Lemma |4n □ 



5 Properties of parametric distance 

In what follows, unless mentioned explicitly, whenever we refer to parametric distance, we mean 
both its triplet and quartet varieties. We begin with a useful observation. 

Proposition 5.1. For every p, q such thatp, q E (0, 1], S'''^ and S'^^ are equivalent. 

Proof. Let Ti and T2 be two rooted (unrooted) trees. Let M be the number of triplets (quartets) 
resolved differently in Ti and let N be the number of triplets (quartets) resolved only in one of Ti 
and T2. Then, ci^P) (Ti, T2) = M + pN, and (Ti, T2) = M + qN. Without loss of generality, let 
p>q. Now,if ci = g/p, then we have cirf(«)(Ti,T2) = qM/p + q^N/p < M+pN = d^\Ti,T2). 
Similarly, if C2 = p/q, then we have C2d^'^\Ti, T2) = pM/q+pN > M+pN = d^P\Ti, T2). Thus, 
cirf(«)(Ti, T2) < d^P\Ti, T2) < C2d^'^\Ti, T2), and, consequently, c/(p) and d^'') are equivalent. □ 

The next result precisely characterizes the ranges of p for which d'^'^^ is a metric or near-metric: 

Theorem 5.1. 

( i) For p = 0, d^P^ is not a distance measure. 

(ii) Forp e (0,1/2), d^f) is a distance measure, but not a metric. 
(Hi) For p E [1/2, 1], d^^^ is a metric. 

(iv) For p E (0, 1/2), d^'^^ is a near-metric. 
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Proof. Our proof is analogous to the proof of the corresponding result for partial rankings given 
by Fagin et al. |fT9l . For the sake of completeness, we prove this result formally. For concreteness, 
we state our arguments in terms of rooted trees and triplets. The extension to unrooted trees and 
quartets is direct. 

For the proof of (i) and (ii), we use the same three triplet trees, ti = ab\c, t2 = abc (i.e., a 
completely unresolved tree), and ^3 = ac\b. To prove (i), we note that (^1,^2) = 0, even though 
ti ^ t2. Thus is not a distance measure. Observe also that S'^^ violates the triangle inequality, 
since S°\h,t2) + S°\t2,h) = 2p = < 1 = S^\ti,ts). 

To prove (ii), observe that ci(p)(ti,t2) = d^^\t2,h) = p,andSP'^{ti,t3) = 1. Thus, rf(p)(ti, tg) = 
1 > 2p = d^P\ti,t2) + d^^\t2,t3), violating the triangle inequality. Thus, is not a met- 
ric in this case. On the other hand, it is straightforward to verify that for any p E (0, 1/2) — 
as well, indeed, as for any p G [1/2, 1] — and any trees Ti and T2, we have rf(p)(Ti,T2) > 0, 
rf(p)(Ti,T2) = rf(p)(T2,Ti), and d^\Ti,T2) = if and only if Ti = T2. Thus, is a distance 
measure in this case. 

We now prove (iii). As mentioned in the proof of part (ii), d^'^^ is a distance measure for 
p G [1/2, 1]. To complete the proof, we show that the triangle inequality holds; i.e., d^P\Ti,T3) < 
d(p)(Ti, T2) + d^\T2, T3) for any three trees Ti, T2, T3. Note that for any z, j G {1, 2, 3}, we can 
express d^P\Ti,Tj) as 

d(P)(T„T,)= Yl d'~P\T,\{a,b,c},T,\{a,b,c}). 

{a,b,c}C[n] 

That is, the distance between Tj and Tj can be expressed as the sum of parametric distances 
between all possible triplet trees induced by Tj and Tj. For any {a,b,c} C [n], and each i G 
{1, 2, 3}, let ti = Ti\{a, b, c}. To complete the proof of (iii), it suffices to show that d^P\ti,t3) < 
d^\tiM) + d^''\t2,h). If ti = h, then d^P\ti,h) = < d^'p\ti,t2) + d^P\t2,h), since dis- 
tances are nonegative. If ti ^ ^3, then d^P^ti^t^) < 1, while d^P\ti,t2) + d^P\t2,h) > 2p. Thus, 
d^''Kh,ts) < Sp\t,,t2) + SPKt2,h) ifpG [1/2,1]. 

Finally, we prove (iv). By Proposition 15. 1[ for every p G (0, 1/2), is equivalent to d^^/'^K 
which, by part (iii), is a metric. The claim now follows from a result by Fagin et al. Ii20ll that 
implies that a distance measure is a near metric if and only if it is equivalent to a metric. □ 

Part (iii) of Theorem 15.11 leads directly to approximation algorithms: Let P be a profile, let T* 
be the median tree for V, and let T = V{i), where i = argmin^ d{V{i),V). Then, by a standard 
approximation bound argument (e.g., like those found in [36]), we have that d(T, V) < 2d(T*, V). 
Part (iv) indicates that the measure degrades nicely, since, along with the 2-approximation algo- 
rithm for p G [1/2, 1] implied by (iii), it leads to constant factor approximation algorithms for 
p G (0, 1/2) (an analogous observation for aggregation of partial rankings is made in [19J). 

The next result establishes a threshold for p beyond which a collection of fully resolved trees 
give enough evidence to produce a fully resolved tree, despite the disagreements among them. 

Theorem 5.2. Let V be a profile of length k, such that for all i G [k], tree V{i) is fully resolved. 
Then, ifp > 2/3, there exists median tree T for V relative to d^^^ such that T is fully resolved. 
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It is interesting to compare Theorem 15.21 with analogous results for partial rankings. Consider 
the variation of Kendall's tau for partial rankings in which a pair of items that is ordered in one 
ranking but in the same bucket in the other contributes p to the distance, where p E [0, 1]. This 
distance measure is a metric when p > 1/2 [19|. Furthermore, if p > 1/2 the median ranking 
relative to this distance (that is, the one that minimizes the total distance to the input rankings) is a 
full ranking if the input consists of full rankings [5)]. In contrast. Proposition 15.11 and Theorem 15. 21 
show that, in the range p G [1/2, 2/3), parametric triplet or quartet distance are metrics, but the 
median tree is not guaranteed to be fully resolved even if the input trees are. The intuitive reason 
is that for rankings, there are only two possible outcomes for a comparison between two elements, 
but there are three ways in which a triplet or quartet may be resolved. This opens up a potentially 
useful range of values for p wherein parametric triplet/quartet distance is a metric, but where one 
can adjust for the degree of evidence (or confidence) needed to resolve a node. 

Our proof of Theorem [5]2] relies on two lemmas, which make use of the two procedures below. 

Pull-Out(T, u): The arguments are a rooted phylogenetic tree T and a non-root node u in T, 
whose parent, denoted by v, has 3 or more children. The procedure returns a new tree T' 
obtained from T as follows. Split v into two nodes v' and v" such that the parent of v' equals 
the parent of v, the children of v' are u and v", and the children of v" are all the children of 
V except for u. 

Pull-2-Out(T, Ml, U2): The arguments are an unrooted phylogenetic tree T and two nodes 
Ml, U2 sharing the same neighbor v whose degree is at least 4 in T. The procedure returns 
a new tree T' obtained from T as follows. Split v into two nodes v' and v" such that the 
neighbors of v' are v", ui, and U2, the neighbors of v" are v' and the neighbors of v except 
for Ml and M2. 

In what follows, we write Tj to denote V{i), the i-ih. tree in profile V, for i E \k]. We need to 
introduce separate but analogous concepts for rooted and unrooted trees. 

Suppose T is a rooted phylogenetic tree and let v be any node in T with at least 3 children, 
denoted Mi,M2, ■ ■ ■ ,Ud. For q E [d], let T^^^ = PuLL-OUT(r, M<y) and let Lg denote the set of 
triplets X such that T\X is not fully resolved but T^''^\X is fully resolved. Define the following 
two quantities. 

/, = J] |{2G [A;] :T,|XagreeswithT('')|X}| (11) 
ag = ^ I {z G [A;] : T,|X disagrees with T(9)|X} I . (12) 

Informally, fg and are the number of votes cast by the trees in profile V for and against the way 
the triplets in Lg are resolved in T^'^\ Indeed, note that, by assumption, every tree in profile V 
is fully resolved. Thus, for each triplet X = {x, y, z} and every i E [k], Tj|X must agree with 
exactly one of x\yz, y\xz, or z\xy. Thus, there are k votes associated with each triplet X, some 
for, some against. 
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Now suppose T is an unrooted phylogenetic tree. Let v be any node in phylogeny T and let 
Ml, M2, . . . , be the neighbors of v. For g, r e [d], let T^'?'') = Pull-2-0ut(T, Uq, Ur) and let 
Lgr denote the set of quartets X such that T\X is not fully resolved but T('''')|X is fully resolved. 
Define the following two quantities. 

U = 5^ |{^ e [k] : T,\X agrees with T^^^^IX}] (13) 
ag, = ^ |{z G [fc] : T,|Xdisagrees withT('^'^)|X}|. (14) 

We have the following result. 

Lemma 5.1. For the rooted case, there exists an index q G [d] such that fq > aq/2. For the 
unrooted case, there exists two indices q,r E [d] such that fqr > aqr/2. 

Proof. For the rooted case, let L = |Jg=i -^g- Thus, L consists of those triplets that are unresolved 
in T, but resolved in T^''\ for some q E [d]. Equivalently, L consists of those triplets whose 
elements are leaves from three different subtrees of v. 

Let X = {x,y,z} be a triplet in L. Assume that x G C{T{uq)), y G C{T{ur)), and z G 
C{T{us)), where g, r, s must be distinct indices in [d]. Then, X is in Lq, L,., and Lg. 

Consider any i E [k]. By assumption, Ti\X is a fully resolved triplet tree. Assume without loss 
of generality that Ti\X = x\yz. Then, T('?)|X agrees with Ti\X, so Ti\X contributes +1 to fq. On 
the other hand, both T^^'^jX and T(*)|X disagree with Ti\X, so Ti\X contributes +1 to and +1 
to ag. Furthermore, for any t {g, r, s}, Ti\X contributes nothing to ft or at, since the triplet tree 
T*^*) \X is not fully resolved. Therefore, we have the following equalities. 

d 

^aq = 2k-\L\ (15) 

d 

Y,fi = k-\L\ (16) 

9=1 

Now suppose that for all q E [d], fq < aq/2. This yields the following contradiction: 

d 1 
q=l 9=1 

Here, the first equality follows from Equation (fT5l) and the last equality follows from Equation (fT6l) . 
Thus, there must be some q E [d] such that fq > aq/2. 

Similarly, for the unrooted case, let L = [j^ ^^j^j Lqr. Thus, L consists of those quartets that 
are unresolved in T, but resolved in T*^'^'"^ for some q,r E [d], q ^ r. Equivalently, L consists of 
those quartets whose elements are leaves from four different neighboring subtrees of v. 
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Let X = {w,x,y,z} be a quartet in L. Assume that w G C{T{ug,v)), x G C{T{ur,v)), 
y G C{T{us, v)), and z G C{T{ut, v)), where g, r, s, t must be distinct indices in [d]. Then, X is in 
Lq, Lj., Lg, and L^. 

Consider any i G [k]. By assumption, Ti\X is a fully resolved quartet tree. Assume, without 
loss of generality, that Ti\X = wx\yz. Then, T(«')|X and T("*)|X agree with Ti\X, so Ti|X con- 
tributes +1 to fqr and fst, respectively. This double contribution is due to the symmetry of quartets. 
On the other hand, T(«") \X, T^^*) \X, T^'''^ \X, and T^*^*) \X disagree with Ti\X, so |X contributes 
+1 to aqs, ciqu o-rs, ^nd tt^, rcspcctively. Furthermore, if at least one of ti, t2 ^ {q, t", s, t}, then 
Ti\X contributes nothing to ft-^t^ or a^^tj, since the quartet tree T*^*^*2)|x is not fully resolved. 
Therefore, similar to the rooted case, we have the following equalities. 

ttqr = Ak ■ \L\ (17) 

<7,re[(i] 

U = 2k ■ \L\ (18) 

q,r£[d] 

Now suppose that for all q,r E [d], q ^ r, fq,,. < aqr/2. This yields the following contradiction: 

2k ■\L\= /gr < ^ ^ ■ 1^1- 

(j,rG[ii] q,r&[c!\ 
q^r q^r 

Here, the first equality follows from Equation (flTI) and the last equality follows from Equation (fTSi) . 
Thus, there must be some g, r G [d], g 7^ r, such that > aqr/2. □ 

Lemma 5.2. Lef V be a profile for [k] over S consisting entirely of fully resolved rooted trees or 
fully resolved unrooted trees. Let T be a phylogeny for S; T is rooted or unrooted according to 
whether V consists of rooted or unrooted trees. Suppose T contains an unresolved node v, and 
suppose p > 2/3. Then, the following holds. 

(i) IfT is rooted, v has a child u such that d^P^ (f , V) < d^P^ (T, V), where f = Pull-Out (T, u). 

(ii) IfT is unrooted, v has two neighbors Uq and Ur such that d^P\T,V) < d'^P^T^V), where 

f = PULL-2-0UT(T, Uq, Ur). 

Proof. We will show that in the rooted case, for all g G [d], 

d(p){Tii)^V) = d^P\T, V)-p- fq + {l-p)- aq. (19) 
And, similarly, in the unrooted case, for all g, r G [d], 

diP){T^'i^)^V) = d^P\T, V)-p- fqr + (1 - J9) ■ aqr. (20) 

To verify this, consider any triplet or quartet X G Lq. For every j such that T^"^^ \X or T^'^'^^ \X 
is identical to Tj \X, the net change in the distance from V is —p, since, for this X, Tj contributes p 
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to the distance to T, but contributes to the distance to T^'^^ or T^'^^^ For every j such that T^'^^ \X 
or T^'?'') \X is different from Tj \X, the net change in the distance from P is 1 — since, for this X, 
Tj contributes p to the distance to T, but contributes +1 to the distance to T^'^^ or T^''''). 

Now, for the rooted case, choose an q* E [d] such that fq* > ag*/2; for the unrooted case, 
choose two indices q*,r* E [d], q* ^ r*, such that fg*r* > a^.^V^- The existence of such a 
q* (or q* and r*) is guaranteed by Lemma [5TT1 Then, Equation (fT9l) and p > 2/3 imply that 
d(p\T^i*\V) < d^P\T,V). Similarly, Equation dlSl) and p > 2/3 imply that (i(P)(T(«*'^*), P) < 
d^P\T,P). □ 

Proof of Theorem l572l If V consists of only fully-resolved trees, then any phylogeny T can be 
transformed into a fully -resolved tree T' such that d^^^ (T', V) < d^^^ (T, V) by doing the following. 
First, let T' = T. Next, while T' contains an unresolved node, perform the following three steps: 

1 . Pick any unresolved node f in T'. 

2. If T is rooted, find a child M oft; such that c;(p)(f,P) < rf(P)(T, P), where f = PuLL-OUT(r, m). 
If T is unrooted, find two neighbors Uq, Uj. of v such that d^P\T,V) < d^P\T,V), where 

f = Pull-2-0ut(T, Mg, Ur). 

3. Replace T' by f. 

Note that the existence of a node u such as the one required in Step 2 is guaranteed by 
Lemma[5]2l Thus, for p > 2/3, there always exists a fully -resolved median tree relative to d^'^^ . □ 

The proof of Theorem 15.21 implies that if p > 2/3 and the input trees are fully resolved, the 
median tree relative to d^^^ must be fully resolved. On the other hand, it is easy to show that 
when p E [1/2, 2/3), there are profiles of fully resolved trees whose median tree is only partially 
resolved. 



6 Relationships among the metrics 

We do not know whether the Hausdorff triplet or Hausdorff quartet distances are computable in 
polynomial time. Indeed, we suspect that, unlike their counterparts for partial rankings, this may 
not be possible. On the positive side, we show here that, in a broad range of cases, it is possible 
to obtain an approximation to the Hausdorff distance by exploiting its connection with parametric 
distance. As in the previous section, our results apply to both triplet and quartet distances. Our 
first result, which is proved later in this section, is as follows. 

Lemma 6.1. For every two phylogenies Ti and T2 over the same set oftaxa, 

c?Haus(Ti,T2) > \V{Ti,T2)\ + ^ • max{ 1 7^l (Ti , Ta) | , |7^2(^l, Ts) |}. 

An upper bound on (inaus is obtained by assuming that Ti and T2 are refined so that the triplets 
(quartets) in 7li{Ti,T2), 7^.2 (Ti, T2), and U{Ti,T2) are resolved differently in each refinement. 
This gives us the following result, which we state without proof. 
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Lemma 6.2. For every two phylogenies Ti and T2 over the same set oftaxa, 

du.ns{Ti,T2) < \V{T,,T2)\ + \n^{T^,T2)\ + |7^2(Tl,T2)| + |W(Ti,T2)|. 

It is instructive to compare Lemmas |6. 1 1 and W2\ with the situation for partial rankings. The 
Hausdorff version of Kendall's tau is obtained by viewing each partial ranking as the set of all 
possible full rankings that can be obtained by refining it (that is, ordering elements within buckets). 
The distance is then the Hausdorff distance between the two sets, where the distance between two 
elements is the Kendall tau score. Critchlow lfT4l has given exact bounds on this distance measure, 
which allow it to be computed efficiently and to establish an equivalence with the parametric 
version of Kendall's tau defined in Section [5] [fT9l . To be precise, let Li and L2 be two partial 
rankings. Re-using notation, let V{Li, L2) be the set of all pairs that are ordered differently in Li 
and L2, 7^1 (Li, L2) be the set of pairs that are ordered in Li but in the same bucket in L2, and 
7?.2(ivi, L2) be the set of pairs that are ordered in L2 but in the same bucket in Li. Then, it can be 
shown that rfHaus(i^i, L2) = \V{Li, ^2)! + max{|7^l(Ll, ^2)!, |7^2(Ll, ^2)!} (see Mm). 

It seems unlikely that a similar simple expression can be obtained for Hausdorff triplet or 
quartet distance. There are at least two reasons for this. Let Li and L2 be partial rankings. Then, it 
is possible to resolve Li so that it disagrees with L2 in any pair in 7^2(-^i, L2). Similarly, there is 
a way to resolve L2 so that it disagrees with Li in any pair in 7^i(Li, L2). We have been unable to 
establish an analog of this property for trees; hence, the | factor in Lemma |6?T1 The second reason 
is due to the properties of the set V({Li, L2). It can be shown that is one can refine rankings Li and 
L2 in such a way that pairs of elements that are unresolved in both rankings are resolved the same 
way in the refinements. This seems impossible to do, in general, for trees and leads to the presence 
of |W(Ti, T2)| in LemmaO 

The above observations prevent us from establishing equivalence between c/naus and S-^\ al- 
though they do not disprove equivalence either. In any event, the next result shows that when the 
number of triplets (quartets) that are unresolved in both trees is suitably small, equivalence does 
hold. 

Theorem 6.1. Let j3 be a positive real number. Then, for every p G (0, 1], Hausdorff distance and 
parametric distance are equivalent when restricted to pairs of trees (Ti, T2) such that \U{Ti, T2) \ < 
(3{\V{TuT2)\ + |7^l(Tl,T2)| + |7^2(Tl,T2)|). 

Proof. By Proposition 15. 1[ it suffices to show that rfnaus is equivalent to Lemma WA\ shows 

that (i*^^/'^^(Ti, T2) < (iHaus(ri) ^2)- Thus, we only need to show that, under our assumption about 
|W(Ti, T2)|, there is some c such that dHaus(Ti, T2) < c ■ rf(2/3)(Ti, T2). The reader can verify that 
the result follows by choosing c = 3 + 3/9 and invoking Lemma [6^ □ 

The remainder of this section is devoted to the proof of Lemma 16. 1[ The argument proceeds in 
two steps. First, we show that Ti can be refined so that it disagrees with T2 in at least two thirds of 
the triplets (quartets) in 7^2 (T^i, T2). Next, we show the existence of an analogous refinement of T2. 
Note that the triplets (quartets) in ©(Ti, T2) are resolved differently in any refinements of Ti and 
T2. This gives lower bounds for both arguments in the outer max of the definition of rfHauslT"!, T2) 
(Equation [2I) and yields the lemma. 
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Let f be a node in Ti. If Ti is rooted, then, as in Section[51 let mi , . . . , denote the children of v 
in Ti and t/^^ denote Pull- OUT (T, Uq). Define Mq{v) to be the set of all triplets X G 7^2(^l, T2) 
such that (i) the lea of X in Ti is v and (ii) Ti\X is unresolved but T^'^^\X is fully resolved. Let 
■^(^0 = Uq=i -^gl^)- Thus, A^(t>) is the set of triplets associated with v that are resolved in T2 
but not in Ti. 

If Ti is unrooted, ui, . . . ,Ud denote the neighbors of v in Ti and T^'''^^ denotes Pull-2-0ut(Ti, Wg^), 
where Pull-2-Out is the function defined in Section [51 Define Aiqr{v) to be the set of all 
quartets X E 7^2(^1,^2) such that (i) Ti\X is a fan, (ii) the paths between any two distinct 
pairs of taxa in X meet at v, and (iii) Ti\X is unresolved but T^'^'^\X is fully resolved. Let 
■M{v) = IJ^ ^Aqr{v). Thus, J^{v) is the set of quartets associated with v that are resolved 

in T2 but not in Ti. 

Define the following two sets for the rooted case. 

Fq = {X e Mq{v) : T2IX agrees with Ti^^lX} (21) 
Aq = {X e Mq{v) : T2IX disagrees with T^'^^lX}. (22) 

Define the following two sets for the unrooted case. 

Fqr = {X e Mqr{v) : T2\X agrecs with T^^'^jX} (23) 
Aqr = {X e Mqr{v) : T2\X dlsagrccs with Ti""^ \X}. (24) 

The next result is, in a sense, a counterpart to Lemma [STTl 

Lemma 6.3. For the rooted case, there exists an index q G [d] such that \Aq\ > 2\Fq\. For the 
unrooted case, there exist two indices q,r E [d], q ^ r, such that \Aqr\ > 2\Fqr\. 

Proof. We start with the rooted case. Consider any triplet X = {x, y, z} in A^(f ). Assume that 
X G C{Ti(uq)), y G C{Ti{ur)), and z G C(Ti(us)), where q,r, s must be distinct indices in [d]. 
Thus, X is in Mq{v), Mr{v), and Ms{v). 

By definition of Ai{v), T2IX is a fully resolved triplet tree. Assume that T2IX = x\yz. Then, 
Tj'-'^-'lX agrees with T2IX, so X contributes exactly one element to Fq. On the other hand, both 
T^''' \X and T^^^ \X disagree with T2IX, so X contributes exactly one element to Ar and one ele- 
ment to As. Furthermore, for any t ^ {q, r, s}, X contributes nothing to Ft or At, since the triplet 
tree T^^ \X is not fully resolved. Therefore, we have that 

d d 

J2\Aq\ = 2-\M{v)\ and ^ = |A<(t;)|. (25) 

q=l q=l 

Assume that for all q E [d], \Fq\ > \Aq\/2. This and (l25]) imply that 

q=l q=l 
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a contradiction. 

We now consider the unrooted case. Consider any quartet X = {w, x, y, z} in J^{v). Assume 
that w G C{Ti{uq,v)), X G C{Ti{ur,v)), y G C{Ti{us,v)), and z G C{Ti{ut,v)), where q,r, s,t 
must be distinct indices in [d]. Thus, X is in Mqr{v), Mqs{v), Mgt{v), Mrs{v), Mrt{v) and 
Mst{v). 

By definition of M{v), T2IX is a fully resolved quartet tree. Assume that T2IX = wx\yz. 
Then, T^''^^\X and T^'^^^\X agree with T2IX, so X contributes exactly one element to Fqr and Fst- 
On the other hand, T^^'^'^jX, T^^'^^^jX, T^^'^'^jX and T^^'^^lX disagree with TajX, so X contributes 
exactly one element to Aqg, Aqt, Ars and A^t, respectively. Furthermore, for any ji and j2 ^ 
{g,r, X contributes nothing to Fj^jj or Aj^j^, since the quartet tree T^^-^^^lX is not fully 
resolved. Therefore, we have that 

^ \Aqr\ = A ■ \Miv)\ and ^ = 2 ■ (26) 

g,rg[d] q,r£[d] 
q+r qi^r 

Assume that for all g, r G [rf], > \Aqr\l2. This and (|26l ) imply that 

2 ■ |A^(t;)| = >\Y. I^^^I = 2 • l-^MI, 

g,rg[d] q,r^\d\ 
q-^r q^r 

a contradiction. □ 
Proof of Lemma IdT] Define the following functions. For any two phylogenies Ti, T2 over 5, let 

dHi{Ti,T2) = max min d{ti,t2), (27) 

tiG.F(Ti) t2&r(T2) 

dH2(Ti,T2) = max min d(ti,t2). (28) 

We show that 

dHi{Ti,T2) > |I)(Tl,T2)| + ^■|7^2(Tl,T2)| (29) 

dH2iTi,T2) > |I?(^l,^2)| + ^•|7^l(^l,T2)|. (30) 

Since (iHaus(ri, r2) = ma.x{dHi{Ti, T2), c?h2(7\, T2)}, this proves Lemma[6?I] 

By symmetry, it suffices to prove Inequality (|29l ). Our argument relies on two observations. 
First, note that if T{ is a refinement of Ti (but possibly not a full refinement), then, (i/^i(Ti, T2) > 
dHi{T[, T2). This holds because T{T[) C T{Ti). Second, for any two phylogenies Ti and r2, 
dHiiTi,T2) > \V{Ti,T2)\. This holds because for any ti G J^(Ti), t2 e -^(^2), we have that 
'D{T^,T2) C P(ti,t2), and (by definition) d(ti,t2) = |I?(ti,t2)|- 

By the preceding observations, if we prove that it is possible to construct a refinement T[ of Ti 
such that \V{Tl,T2)\ > |I?(Ti,T2)| + ||7^2(Tl, T2)|, then Inequality ^ follows. The idea is to 
find a refinement T{ of Ti such that for at least two-thirds of the triplets or quartets X G 7^2 (^1 , ^2) ^ 
we have that T^X 7^ T2IX. To obtain the desired refinement of Ti, we initially set T{ = Ti and 
then perform the following steps while they apply: 
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1. Pick an unresolved node v in Tj* such that Ai'{v) ^ 0, where Ai'{v) is the set of triplets 
(quartets) associated with v that are resolved in T2 but not in T[. In the rooted case, let 
Ml, . . . , Mrf be the children of t>; in the unrooted case, let mi, . . . , be the neighbors of v. 

2. For rooted trees, find a q E [d] such that \Ag\ > 2\Fg\ (such a q exists by Lemma l6Jl) . For 
unrooted trees, find q,r E [d] such that \Agr\ > 2\Fgr\ (such q, r exist by Lemma [63l) . 

3. In the rooted case, set T{ = Pull-Out(T{, u,^); in the unrooted case, set T{ = Pull-2-0ut(T{, Ug, m^). 

When this algorithm terminates, M'{v) = for every v E V(T{). Thus, 7^2(T{,T2) = 0. 
Furthermore, the choice of q (or qi and 52) in step (2) guarantees that |P(T/, T2) | > [©(Ti, T2) | + 
|■|7^2(Tl,T2)|. □ 



7 Computing parametric triplet distance 

In this section we show that the parametric triplet distance (PTD), rf^^^ between two phylogenetic 
trees Ti and T2 over the same set of n taxa can be computed in 0{n^) time. 

Before we outline our PTD algorithm, we need some notation. Let T be a rooted phylogenetic 
tree. Then, R{T) denotes the set of all triplets that are resolved in T and U (T) denotes the set of 
all triplets that are unresolved in T. 

The next proposition is easily proved. 

Proposition 7.1. For any two phylogenies Ti, T2 over the same set of taxa, 

(i) |7^l(Tl,T2)| + |W(Tl,T2)| = \U{T2)\ 

in) \n2{T^,T2)\ + |W(Ti,T2)| = \U{T^)\, 

(Hi) |5(Ti,T2)| + \V{T^,T2)\ + |7^l(Tl,T2)| = \R{T^)\. 

By Prop. 17.1 1 and Eqn. ([T]), the parametric distance between Ti and T2 can be expressed as 

rf(^)(Ti,T2) = \R{T^)\ - |5(ri,T2)| +J9 • (|f/(ri)| - \UiT2)\) + (2p - 1) ■ |7^l(Tl, Ts)!. (31) 

Our PTD algorithm proceeds as follows. After an initial O(n^) preprocessing step (Sec- 
tion |7.1| ). the algorithm computes \R{Ti)\, \U{Ti) \ and \U{T2) \ using a 0(n)-time procedure (Sec- 
tion IT!]). Next, it computes \S{Ti,T2) \ and |7^l(Tl, Ts)]. As described in Sections O and ITU 
this takes O(n^) time. Then, it uses these values to compute d^^^Ti, T2), in 0(1) time, via Equa- 
tion (I3TI) . To summarize, we have the following result. 

Theorem 7.1. The parametric triplet distance d^\Ti,T2) for two rooted phylogenetic trees Ti 
and T2 over the same set ofn taxa can be computed in 0{n'^) time. 

In the rest of this section we use the following notation. We write rt{T) to denote the root node 
of a tree T. Let be a node in T. Then, pa{v) denotes the parent of t> in T and Ch{v) is the set of 
children of v. We write T{v) to denote the tree obtained by deleting T{v) from T, as well as the 
edge from v to its parent, if such an edge exists. 
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7.1 The preprocessing step 

The purpose of the preprocessing step is to calculate and store the following four quantities for 
eve ry pair {u,v), where e V{T\ymdv e V(72): \C{Ti{u))n/:{T2{v))\, \jC{Ti{u)) (1 C{T^)\, 
\jC{Tilu)) n jC{T2{v))\, and \C{Ti{u)) n C{T2{v))\. These values are stored in a table so that any 
value can be accessed in 0(1) time by subsequent steps of the PTD algorithm. 

Lemm a 7.1. The va lues \ C{Ti{u)) n C{T2{v))\, \C{Ti{u)) n C{T^)\, |£(Ti(n)) n C{T2{v))\, 
and |£(Ti(m)) fl C{T2{v))\ can be collectively computed for every pair of nodes {u,v), where 
u G V(Ti) andv G V(T2), in 0{n'^) time. 

Proof. We first observe that for each u G V(Ti), the value \C{Ti{u)) \ can be computed in 0{n) 
time by a simple post order traversal of Ti. The same holds for tree T2. 
Consider the value \C{Ti{u)) n C{T2{v))\. We consider three cases. 

1. If M and V are both leaf nodes then computing |£(Ti(m)) fl C{T2{v)) \ is trivial. 

2. If M is a leaf node, but v is not a leaf node, then 

\L{T,{u))^C{T2{v))\^ J2 \C{T,{u)) n £{T2{x))\. 

x£Ch{v} 

3. If is not a leaf node, then 

\C{T,{u))nC{T2{v))\- E \m{x))nC{T2{v))\. 

xeCh{u) 

We compute the value \C{Ti{u)) fl C{T2{v))\, for every pair (m, w ), using an interleaved post 
order traversal of Ti and T2. This traversal works as follows: For each node w in a post order 
traversal of Ti, we consider each node v in a post order traversal of T2. This ensures that when 
the intersection sizes for a pair of nodes is computed, the set intersection sizes for all pairs of their 
children have already been computed. The total time complexity for computing the required values 
in this way can be bounded as follows. For a pair of nodes u and v from Ti and T2 respectively, the 
value \ jC{Ti{u)) n C{T2{v)) \ can be computed in 0(| Ch{u) \ + \ Ch{v)\) time and all the remaining 
three set intersection values in 0(1) time. Summing this over all possible pairs of edges, we get a 
total time of 0(Xl„ev{Ti) E^gv(T2) I Ch{u)\ + \ Ch{v)\), which is 0{n^). 

Once the value \C{Ti{u)) n C{T2{v))\ has been computed for every pair (m, v), the remaining 
quantities we seek can be computed using the following relations. 

imin)) n C{T^))\ = \m{u))\ - \Cmu)) n C{T2{v))\, 

|£(Ti(w)) n CiT2iv))\ = \CiT2iv))\ - \CiT,iu)) n £(T2(t;))|, and 

|£(7>)) n £{T^))\ = n - (|£(r,(«))| + \C{T2{v))\ - \£{Uu)) n C{T2{vm. 

Thus, each of these values can be computed in 0(1) time, for a total of O(n^). □ 

We store these O(n^) values in an array indexed by u and v, for each u G V(7i) and v G V{T2). 
This enables constant time insertion and look-up of any stored value, when the two relevant nodes 
are given. 
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7.2 Computing \R{T^)\, |f/(Ti)|, and \U{T2)\ 



Here we prove the following result. 

Lemma 7.2. Given a rooted phylogenetic tree T over n leaves, the values \R{T) \ and \ U{T) \ can 
be computed in 0{n) time. 

Thus, \R{Ti)\, \U{Ti) \ and \U{T2) \ can all be computedin 0(n) time. 

To prove Lemma 17^ we need some terminology and an auxiliary result. Let e = {v,pa{v)) be 
any internal edge in T. Consider any two leaves x, y from C(T(v)), and any leaf z from C{T{v)). 
Then, the triplet {x, y, z} must appear resolved as xy\z in T; we say that the triplet tree xy\z is 
induced by the edge {v,pa{v)). Note that the same resolved triplet tree may be induced by multiple 
edges in T. We say that the triplet tree xy\z is strictly induced by the edge {v,pa{v)} if xy\z is 
induced by {v,pa{v)) and, additionally, x G C{T{vi)) and y G C{T{v2)) for some vi, V2 G Ch{v) 
such that vi 7^ f 2. 

Lemma 7.3. Given a tree T and a triplet X, ifT\X is fully resolved then T\X is strictly induced 
by exactly one edge in T. 

Proof. Let X = {a, 6, c}. Without loss of generality, assume that T\X = ab\c. If v denotes the lea 
of a and b in T, the edge {v,pa{v)} must induce ab\c. Moreover, v must be the only node in T for 
which there exist nodes fi,f2 G Ch(v) such that a G C{T{vi)) and b G C{T{v2)). Thus, there is 
exactly one edge in T that strictly induces T\X. □ 

Proof of Lemma^ Since \R{T) \ + \U{T)\ = g), given \R{T)\, the value \U{T) \ can be com- 
puted in 0(1) additional time. Thus, we only need to show that the value of \R{T) \ can be com- 
puted in 0{n) time. 

The first step is to traverse the tree T in post order to compute the values = \C{T{v)) \ and 
Pj, = n — ay at each node v G V(T). This takes 0{n) time. 

For any v G V(T) \ {rt{T)}, let (f)(v) denote the number of triplets that are strictly induced 
by the edge {v,pa{v)} in tree T. Observe that any triplet that is strictly induced by an edge in T 
must be fully resolved in T. Thus, Lemma IT3] implies that the sum of (^(f ) over all internal nodes 
V G V(T) \ {rt{T)} yields the value \R{T)\. We now show how to compute the value of (piv). 

Let X = {a,b,c} be a triplet that is counted in </)(f). And, without loss of generality, let 
Ti \X = ab\c. It can be verified that X must satisfy the following two conditions: (i) a,b E C(T(v)) 
and c G C(T(v)), and (ii) there does not exist any x G Ch{v) such that a,b E C{T(x)). The number 
of triplets that satisfy condition (i) is ("2") ■ Pv, and the number of triplets that satisfy condition (i), 
but not condition (ii) is exactly ExGCft(^) ("2") ' f^v- Thus, (j){v) = 7^, - T^xechiv) {"2) ' f^^- 

Computing 0(t>) requires 0{\ Ch{v)\) time; hence, the time complexity for computing |-R(T)| 
is C>(J2veV{T) I C/z(w)|), which is 0{n). □ 

7.3 Computing |<S(ri,r2) I 

We now describe an O(n^) time algorithm to compute the size of the set S{Ti,T2) of shared 
triplets; that is, triplets that are fully and identically resolved in Ti and T2. 
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For any u e V(Ti) \ (rt(Ti) U C{Ti)) and v G V{T2) \ {rt{T2) U C{T2)), let s{u, v) denote the 
number of identical triplet trees strictly induced by edge {u^pa{u)} in Ti and edge {v,pa{v)} in 
T2. We have the following result. 

Lemma 7.4. Given Ti and T2, we have, 

\S{T,,T2)\= Yl ^(^'^)- (32) 

«ev(Ti)\(rt(Ti)u/:(Ti)), 

v(^V{T2)\{rt{T2)UC{T2)) 

Proof. Consider any triplet X G 5(Ti, T2). Since Ti|X is fully resolved and Ti|X = TajX then, 
by Lemma 1731 there exists exactly one node u G V(Ti) \ rf(Ti) and one node v e V{T2) \ rt{T2) 
such that the edge {u,pa{u)} strictly induces Ti|X inTi, and edge {v,pa{v)} strictly induces T2IX 
in T2. Additionally, neither u nor v can be leaf nodes in Ti and T2 respectively. Thus, X would 
be counted exactly once in the right-hand side of Equation (|32l) in the value s(m, v). Moreover, 
by the definition of v), any triplet tree that is counted on the right-hand side of Equation (|32l ) 
algorithm must belong to the set S (Ti , T2) . The Lemma follows. □ 

The following lemma shows how to compute the value of s(m, v) using the values computed in 
the preprocessing step. 

Lemma 7.5. Given any u G V{Ti) \ {rt{Ti) U C{Ti)) and v G V{T2) \ {rt{T2) U C{T2)), s{u,v) 
can be computed in 0{\ Ch{u) \ ■ \ Ch{v)\) time. 

Proof. We will show that s('U, v) = ni{u, v) — n2{u, v) — n^{u, v) + ni{u, v), where 
(\C(T^{u))r\C{T2{v) 



^ (\C{T,{x))nC{T2{v))\\ 

n2[u,v)= 2^ I 2 1 ■ |/:(Ti(n)) n/:(T2(t;))|, 

>^ f\C{T,{u))nC{T2{x))\\ 

ns{u,v)= 2^ I 2 j ■ \^Tl{u)) n C[T2{v))\, and 



n4(^'U, V ) = 

xdChiu) ydChiv) 

Consider any triplet tree, ah\c, counted in v). It can be verified that ah\c must satisfy the 
following three conditions: (i) a, 6 G C{Ti{u)) n C(T2{v)) and c G /^(r^) n C(T^), (ii) 
there does not exist any x G Ch{u) such that a,b E C{Ti(x)), and (iii) there does not exist any 
X G C/z(f) such that a,b E C{T2{x)). Moreover, observe that any triplet tree ab\c that satisfies 
these three conditions is counted in s{u,v). Therefore, s{u,v) is exactly the number of triplets 
trees that satisfy all three conditions (i), (ii) and (iii). 

The number of triplet trees that satisfy condition (i) is given by ni(u, v). Some of the triplet 
trees that satisfy condition (i) may not satisfy conditions (ii) or (iii); these must not be counted in 
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procedure S(Ti,T2) 
1: for each internal node u G V(Ti) \ rt{Ti) do 
2: for each internal node v e V(T2) \ rt{T2) do 
3: Compute s(m, f). 
4: return the sum of all computed s(-, ■). 

Figure 1: Computing \S{Ti,T2)\ 

s{u, v). The value n2{u, v) is exactly the number of triplet trees that satisfy condition (i) but not 
condition (ii). Similarly, n2,(u,v) is exactly the number of triplet trees that satisfy condition (i) 
but not (iii). Thus, the second and third terms must be subtracted from the first term. However, 
there may be triplet trees that satisfy condition (i) but neither (ii) nor (iii), and, consequently, get 
subtracted in both the second and third terms. In order to adjust for these, the value n^^u, v) counts 
exactly those triplet trees that satisfy condition (i) but not (ii) and (iii). □ 

A summary of our algorithm to compute \S{Ti, T2) \ appears in Figure[il 

Lemma 7.6. Given two rooted phylogenetic trees Ti and T2 on the same n leaves, the value 
|iS(Ti, T2)| can be computed in 0{n'^) time. 

Proof. By Lemma l74l the algorithm of Figure [U computes the value |5(Ti,T2)| correctly. We 
now analyze its complexity. The running time of the algorithm is dominated by the complexity of 
computing the value v) for each pair of internal nodes u E V(Ti) and v E V(T2). According 
to Lemma 1731 the value s{u, v) can be computed in 0(| Ch(u)\ ■ \ Ch{v)\) time. Thus, the total 
time complexity of the algorithm is 0(^uev{Ti) Yjvev{T2) I Ch{u) \ ■ \ Ch{v)\), which is O(n^). □ 

7.4 Computing |7^l (Ti , T2) \ 

Next, we describe an 0(r;,^)-time algorithm that computes the cardinality of the set TZi{Ti, T2) of 
triplets that are resolved only in tree Ti. First, we need a definition. Let X be a triplet that is 
unresolved in T2. Let v be the least common ancestor (lea) of X in T2. We say that X is associated 
with V. Observe that node v must be internal and unresolved. Note also that X is associated with 
exactly one node in T2. 

For any u E V(Ti) \ {rt{Ti) U C{Ti)) and v E V{T2) \ C{Ti)), let ri(M, v) denote the number 
of triplets X such that Ti \X is strictly induced by edge {u,pa{u)} in Ti, and X is associated with 
the node v mT2. 

The triplets counted in ri{u, v) must be resolved in Ti but unresolved in T2. Our algorithm 
computes the value |7?.i(Ti, T2)| by computing, for each u E V{Ti) \ {rt{Ti) U C{Ti)) and v E 
V(T2) \ C{T2), the value ri{u, v). We claim that the sum of all the computed ri{u, f )'s yields the 
value |7^l(Tl,T2)|. 

Lemma 7.7. Given Ti and T2, we have, 

|7^(Tl,T2)|= Yl (33) 

«GV(Ti)\(rt(ri)U£(Ti)), 
i'GV{T2)\£(T2) 
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Proof. Consider any triplet X E 7^i(Ti, T2). By Lemma 1731 there exists exactly one node u G 
V(Ti) \ rt{Ti) such that the edge {u,pa{u)} strictly induces Ti\X in Ti. Also observe that there 
must be exactly one unresolved node v G V(T2) with which X is associated. Additionally, neither 
u nor V can be leaf nodes in Ti and T2 respectively. Thus, X would be counted exactly once in the 
right-hand side of Equation (l33l) : in the value ri(u, v). Moreover, by the definition of ri(n, v), any 
triplet that is counted in the right-hand side of Equation (1331 ) must belong to the set 7^i(Ti, T2). 
The lemma follows. □ 

Given a path ui, U2, ■ ■ ■ , Uk, where k > 2,m tree Ti such that Uk is an internal node and Ui is 
an ancestor of Uk, let 7(^1, Uk, v) denote the number of triplets X such that Ti\X is induced by 
every edge {ui-i, Ui}, for 2 < i < k, in Ti and X is associated with node v in T2. 

The following lemma shows how the value of ri(u,v) can be computed by first computing 
certain 7(-, - , ■) values. 



Proof. Let X = {a, b, c} be a triplet that is counted in ri(n, v). And, without loss of generality, let 
Ti\X = ab\c. It can be verified that X must satisfy the following three conditions: (i) X must be 
associated with v in T2, (ii) a, 6 G C{Ti{u)) and c G £(Ti(u)), and (iii) there must not exist any 
X G Ch{u) such that a, 6 G C{Ti{x)). Moreover, observe that if there exists a triplet X = {a, b, c} 
that satisfies these three conditions, then X will be counted in ri{u, v); these three conditions are 
thus necessary and sufficient. 

Now observe that 'j{pa{u),u, v) counts exactly those triplets that satisfy conditions (i) and (ii), 
while J2xech{u) liP^i'^)^^^ ^) counts exactly those triplets that satisfy conditions (i) and (ii), but 
not condition (iii). The lemma follows immediately. □ 

To compute the value of 7(-, - , ■) efficiently we use the following lemma. 

Lemma 7.9. Consider a path ui, U2, . . . ,Uk, where k > 2, in tree Ti such that Uk is an internal 
node and Ui is an ancestor ofu^. And let v G V(T2) be an internal unresolved node. Then, 



Lemma 7.8. For any u G V(Ti) \ (r/'(Ti) U L{T^)) andv G V(T2) \ C{T2)), 




x£Ch{u) 



7(ui, Uk, v) = ni{ui, Uk, v) - ^2(^1, Uk, v) - n^^Ui, Uk, v) - ni{ui, Uk, v) 



where 



ni{ui,Uk,v) = 



\C{T2iv)) n CiT,{uk))\ 



■ \C{T2{v)) n miu2))\ 



2 




x£Ch{v) 




xeCh{v) 



■ {\c{T2{v)) n c{T,{u2))\ - \c{T2{x)) n /:(TiM)|) , 



24 



and 



n^{ui,Uk,v)= J2 mT2ix)) n CiTiiuk))\ ■ \C{T2ix)) n C{Tiiu2))\ 

x£Ch{v) 

■ {\C{T2{v)) n C{T,{uk))\ - \C{T2{x)) n C{T,{uk))\). 

Proof. Consider those triplets X for which Ti \X is induced by every edge Ui), for 2 < z < 
k, in Ti, and T2IX is a subtree of T2(f ). Let us call these triplets relevant. Any relevant triplet 
must have all three leaves from C{T2{v)), two leaves from C{Ti(uk)), and the third leaf from 
£(Ti(m2)). Also note that any triplet that satisfies these three conditions must be relevant. The 
number of triplets that satisfy these conditions is exactly ni(ui, Uk, v). 

Any relevant triplet X must belong to one of the following four categories: 

1. The lea of X in T2 is not node v : This implies that, in addition to being a relevant triplet, all 
three leaves of X must belong to the same subtree of T2 rooted at a child of v. The number 
of such triplets is n2(Mi, Uk, v). 

2. The lea ofX in T2 is node v, X is resolved in T2 and Ti\X = T2IX : A relevant triplet X 
satisfies this criterion if and only if there exists a child x E Ch(v), such that the two leaves 
of this triplet that belong to C{Ti{uk)) in tree Ti also occur in £(T2(x)), and, the third leaf 
(which occurs in £(Ti(u2))| in Ti) occurs in C{T2{y)) where y G Ch{v) \ {x}. The number 
of such X is equal to n3{ui,Uk, v). 

3. The lea of X in T2 is node v, X is resolved in T2, but Ti\X ^ T2IX : A relevant triplet X 
satisfies this criterion if and only if there exists a child x G Ch{y), such that a pair of the 
leaves of X that occur in C{Ti{uk)) and C{Ti{u2)) respectively in tree Ti occur in C{T2{x)) 
in tree T2, and, the third leaf (which occurs in C{T2{x)) in Ti) occurs in C(T2{y)) where 
y G Ch{v) \ {x}. The number of such X is given by ^4(^1, Uk, v). 

4. The lea of X in T2 is node v, and X is unresolved in T2 : By definition, the number of 
relevant triplets that satisfy this criterion is exactly 7(^1, Uk, v). 

We have shown that ^2(^1, Uk, v), n^{ui,Uk, v), and n^iui, Uk, v) are exactly the number of 
relevant triplets belonging to categories 1, 2, and 3 respectively. The lemma follows. □ 

We should remark that the procedure to compute the value of 7(^1, Uk, v) given in the preceding 
proof may seem circuitous. However, we have been unable to find a direct method with an equally 
good time complexity. 

Lemma 7.10. Given two phylogenetie trees Ti and T2 on the same n leaves, the value |7^i(Ti, T2) \ 
ean be eomputed in O(n^) time. 

Proof. Our algorithm for computing |7?.i(Ti, T2)| appears in Figure |2l The correctness of the 
algorithm follows from Lemma 17.71 We now analyze its complexity. For any given candidate 
nodes u, v. Lemma fL9\ shows how to compute 7(-, ■, v) in 0{\ Ch(v)\) time, and consequently, by 
Lemma rrsl the value ri(u, v) can be computed in 0{\ Ch(u)\ ■ \ Ch{v)\) time. Thus, the total time 
complexity of the algorithm is 0{^^^yf^j._^-^ Yjvev{T2) I Ch{u) \ ■ \ Ch{v)\), which is 0{n^). □ 
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procedure ni{Ti,T2) 
1: for each internal node u eV(Ti)\ {rt(Ti)} do 
2: for each internal unresolved node v E V(T2) do 
3: Compute ri(M, w). 
4: return the sum of all computed ■). 

Figure 2: Computing |7^l(Tl,T2)| 

8 An approximation algorithm for parametric quartet distance 

We now consider the problem of computing the parametric quartet distance (PQD) between two 
unrooted trees. Our main result is an 0(n^)-time 2-approximate algorithm for PQD. 

Our approach is similar to the one for computing the parametric triplet distance. Observe that 
Proposition 17. II and, thus, Equation (|3TI) hold even when the unit of distance is quartets instead 
of triplets. Christiansen et al. (HI show how to compute the values |5(Ti, Ta)], \R{Ti)\, \U{Ti)\, 
and |[/(T2)| within 0{n^) time. In Section [STTI we show how to compute, in O(n^) time, a value 
y such that |7^l(Tl,T2)| < y < 2|7^l(Tl, Ts)]. Now, let us substitute the values of |i?(Ti)|, 
|f/(Ti)|, \U{T2)\ and |>S(ri,T2)| into Equation dlB, and use the value of ?/ instead of |7^l(Tl, T2)|. 
Assuming p > 1/2, it can be seen that the result is a 2-approximation to (Ti, T2). 

To summarize, we have the following result. 

Theorem 8.1. Given two unrooted phylogenetic trees Ti and T2 on the same n leaves, and a 
parameter p > 1/2, a value x such that d^\Ti,T2) < x < 2 ■ Sp\Ti,T2) can be computed in 
0{n^) time. 

We note that the (2p — 1) ■ |7^i(Ti, T2)| term in Equation (|3TI) vanishes when p = i. In this 
case, we do not even need to compute \Ri{Ti, T2) | to get the exact value of d^P\Ti,T2). 

8.1 Computing a 2-approximate value of 1 7?.i (Ti , | 

For any node u in T, let adj{u) denote the set of nodes that are adjacent to u. For the purposes of 
describing our algorithm, it is useful to view each (undirected) edge {u, v} E £(T) as two directed 
edges {u, v) and {v, u). Let 8 (T) denote the set of directed edges in tree T. 

To achieve the claimed time complexity, our algorithm relies on a preprocessing step which 
computes and stores, for each pair of directed edges (ui, vi) E £ (Ti) and (^2, V2) E £ (T2), the 
quantity \C{Ti{ui, vi)) fl C{T2{u2, V2))\. This can be accomplished in 0{n'^) by arbitrarily rooting 
Ti and T2 at any internal node and proceeding as in the preprocessing step for the triplet distance 
case (see Section ItTI ). 

Consider any two leaves a, b from C{T(u, v)) and any two leaves c, d from C{T{v, u)). Then, 
the quartet {a, b, c, d} must appear resolved as ab\cd in T; we say that the quartet tree ab\cd is 
induced by the edge {u, v). Note that the same resolved quartet tree may be induced by multiple 
edges in T. Additionally, if x E ui and y E U2 for some ui, U2 E adj{u) \ {v} such that ui 7^ U2, 
then we say that the quartet tree ab\cd is strictly induced by the directed edge (u, v). 
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Consider a quartet {a, b, c, d}. Then, the corresponding quartet tree is unresolved in T if and 
only if there exists exactly one node w such that the paths from w to a, w to b, w to c, and w to d 
do not share any edges. We say that quartet {a, b, c, d} is associated with node w in T. Thus, each 
unresolved quartet tree from T is associated with exactly one node in T. 

For any directed edge {u,v) E S (Ti) and w G V(T2) \ C{Ti), let ri{{u,v),w) denote the 
number of quartets X such that Ti\X is strictly induced by the directed edge (u, v) in Ti, and X is 
associated with the node w in T2. The quartets counted in ri((u, v),w) must be resolved in Ti but 
unresolved in T2. We have the following result. 

Lemma 8.1. Given Ti and T2, we have 

2- |7^l(Tl,T2)| = Yl n{{u,v),w). 

{u,v)e~SiTi), 
w£V{T2)\C{T2) 

Proof. Let X = {a,b,c,d} be any quartet in |7^i(Ti, T2)|. Without loss of generality, assume 
that Ti\X = ab\cd, and that X is associated with node w E ^(^2) \ £^{T2). Since X appears 
resolved in Ti, 8 (Ti) must have exactly two directed edges, say {ui.vi) and (^2,^2)^ which 
strictly induce ab\cd. Thus, X is counted in exactly two of the ri(-, ■)'s, namely, ri((Mi, tii), w), 
and ri((M2, V2),w). The lemma follows. □ 

Thus, we can compute |7^i(Ti, T2)| by computing all the 0{n'^) possible ri{{u, v), wYs. How- 
ever, doing so seems to require at least 6(ra^ ■ d) time, where d is the degree of Ti. Instead, our 
algorithm computes a 2-approximate value of |7^i(Ti,T2)| in O(n^) time by relying on the next 
lemma. 

Lemma 8.2. Given Ti and T2, let T[ denote the rooted tree obtained from Ti by designating any 
internal node in V{Ti) as the root. Then, 

\n^{T,,T2)\< Yl ri((M,pa(u)),^) <2- |7^l(Tl,T2)|. 

u€V{n)\{nm)uC(Ti)), 
weV{T2)\C{T2) 

Proof First, observe that if m G C{T{) and w E V(T2) \ C(T2), then ri{{u,pa{u)),w) = 0. 
Therefore, we must have 

Y ri{{u,pa{u)),w) = Y ri{{u,pa{u)),w). 

uGV{Ti)\{rt{Ti)UC{Ti)), u&V {T[)\rt[T{) , 

w(iV{T2)\C(T2) w£V(T2)\C{T2) 

Second, observe that S{Ti) = £{T[) and, therefore, by Lemma [STTl we must have 

n{{u,pa{u)),w) <2-\ni{Ti,T2)\. 

ueV{Ti)\rt{Ti), 
w£V(T2)\C{T2) 

This proves the second inequality in the lemma. 
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To complete the proof, we now prove the first inequality. Let X = {a, b, c, d} be any quartet in 
|7?.i(Ti, T2)|, and, without loss of generality, assume that Ti\X = ab\cd, and that X is associated 

with node w E V(T2) \ C(T2). Since X appears resolved in Ti, £ (Ti) must have exactly two 
directed edges, say (ui, vi) and (u2, ^2), which strictly induce ab\cd. Consider the edge {ui,vi} G 
£{T[). There are two possible cases: Either vi = pa{ui), or ui = pa{vi). If vi = pa{ui) 
then the quartet X will be counted in the value ri{{ui,pa{ui)),w). Otherwise, if Ui = pa{vi), 
then Ml, t>i, 112, M2 must appear on a same root-to-leaf path in T[. Consequently, we must have 
V2 = pa{u2) and the quartet X would be counted in the value ri{{u2,pa{ui)), w). Thus, we must 
have |7^l(Tl,T2)| < J2uev{Ti)\rtiTi)J2-u^ev{T2)\c(T2)^i(i^^P^i^))^^)- The lemma follows. □ 

Thus, the idea for efficiently computing a 2-approximate value of \Tli{Ti, T2)\ is to first root 
Ti arbitrarily at any internal node and then compute the value ri{(u,pa{u)), w) for each non-root 
node u e V{Ti) and each w e V(T2) \ C{Ti). 

We now direct our attention to the problem of efficiently computing all the required values 
ri(-, ■). Given a path ui, U2, ■ ■ ■ ,Uk in Ti, where k > 2, let 7(^1, Uk, w) denote the number of 
quartets X such that Ti\X is induced in Ti by every edge (ui_i,Ui), 2 < i < k, and X is associated 
with node in T2. 

The following lemma is analogous to Lemma iT^Sl and shows how the value ri(-, •) can be 
computed by first computing certain 7(-, - , ■) values. 

Lemma 8.3. Let {u, v) G S{Ti), and w e V{T2) \ C{T2)), then, 

ri{{u,v),w) = 'j{u,v,w)— 'j{x,v,w). 

x£adj{u)\{v} 

Proof. Let X = {a, 6, c, rf} be a quartet that is counted mri{{u,v),w). Without loss of generality, 
let Ti\X = ab\cd such that a,b E C(Ti(u, v)). It can be verified that X must satisfy the following 
three conditions: (i) X must be associated with node w in T2, (ii) a,b E C{Ti{u, v)) and c,dE 
C{Ti{v,u)), and (iii) there must not exist any x E adj{u) \ {v} such that a, 6 G £(Ti(x, m)). 
Moreover, observe that if there exists a quartet X = {a, b, c, d} that satisfies these three conditions, 
then X will be counted in ri((u, v),w); these three conditions are thus necessary and sufficient. 

Now observe that 7(u, v, v) counts exactly all those quartets that satisfy conditions (i) and (ii), 
while J2xech{u) l{P<^{'^)-,^i ^) counts exactly all those quartets that satisfy conditions (i) and (ii), 
but not condition (iii). The lemma follows. □ 

To state our next results we need the following notation. Given phylogenetic trees Ti and T2, 
consider a path ui,U2, ■ ■ ■ ,Uk where > 2, in tree Ti, and an internal node w E V(T2) of degree at 
least 4. Let P = C{Ti{ui, U2)), Q = C{Ti{uk, Uk-i)) and let Xi, . . . , xiadj{w)\ denote the neighbors 
of w. Consider the quartets that are induced by every edge {ui_i,Ui), 2 < i < k, in Ti. Let us 
call these quartets relevant. Observe that a quartet is relevant if and only if it contains exactly two 
leaves from P and two leaves from Q. Let 

1. Til (til, Uk, w) denote the number of relevant quartets X for which there exists a neighbor x 
of w in tree T2, such that X is completely contained in T2{x, w). 
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2. n2{ui, Uk, w) denote the number of relevant quartets X for which there exist two neighbors 
x,y of w in tree T2, such that T2{x, w) contains three leaves from X and T2{y, w) contains 
the other leaf, 

3. usi^ui, Uk, w) denote the number of relevant quartets X for which there exist two neighbors 
x,yofw in tree T2, such that T2(x, w) contains two leaves from X and T2{y, w) contains the 
other two leaves, and 

4. 71,4 (ui, Uk, w) denote the number of relevant quartets X for which there exist three neighbors 
X, y,zofw in tree T2, such that T2{x, w) contains two leaves from X, T2{y, w) contains one 
leaf from X, and T2{z, w) contains the remaining leaf. 

Then, we must have the following. 
Lemma 8.4. 

7(mi, Uk, w) = (^^^^ ■ f ^'^ - ni{ui, Uk, w) - n2{ui, Uk, w) - n^{ui, Uk, w) - n^{ui, Uk, w). 

(34) 



2 / V 2 



Proof. The term ('^') ■ (''^') is the number of relevant quartets. Furthermore, each relevant quartet 
must occur in tree T2 in exactly one of the five configurations captured by the terms ni{ui, Uk, w), 
712(^1, Uk, w), n^{ui, Uk, w), ^4(^1, Uk, w), and 7(^1, Uk, w). The lemma follows. □ 

The following four lemmas show that the values of ni{ui, Uk, w), n2{ui, Uk, w), ^3(^1, Uk, w), 
and ni{ui,Uk, w) can be computed in 0(| adj{w)\) time. The proofs of these lemmas all follow 
the same approach: hi each case, we show that the required value can be expressed as a sum of 
0(1 adj{w)\) quantities, every one of which can be computed in 0(1) time based on the values 
computed in the pre-processing step. 

Lemma 8.5. The value ni{ui, Uk, w) can be computed in 0(| adj{w)\) time. 
Proof. We will show that 

'^^V|/:(T2(a;„^))nP|\ (\C{T2{x,,w))nQ\\ 
ni{ui,Uk,w)= 2^ [ 2 )■( 2 )■ 



i=l 



The right hand side of Equation (|35] ) counts all those quartets that are completely contained in 
C(T2{x, w)) for some x E adj{w) and that have two elements from P and two from Q. These are 
exactly the quartets that must be counted in ni{ui, Uk, w). □ 

Lemma 8.6. The value n2{ui,Uk, w) can be computed inO{ \ adj{w)\) time. 
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Proof. We will show that 



\adj{w)\ 



The quartets X counted in 722(^1, m^, w) are exactly those for which there exist two neighbors 
x,y of w such that either (i) X fl C(T2{x, w)) contains two leaves from P and one from Q, and 
X n £(T2(?/, w)) contains a leaf from Q or (ii) X fl £(T2(x, w)) contains two leaves from Q and 
one from P, and X fl C{T2{y, w)) contains a leaf from P. The first term on the right hand side of 
Equation (|36l ) is exactly the number of quartets that satisfy condition (i), and the second term on 
the right hand side is exactly the number of quartets satisfying condition (ii). □ 

Lemma 8.7. The value n^lui, Uk, w) can be computed in 0{\ adj{w)\) time. 

Proof. We will show that 

i-^^ij f\c{T2{x„w))nP\\\ f\cmx,,w))nQ\\ 

n3{ui,Uk,w)= 2^ <^a-\^ 2 Jj'V 2 ) 

^ \adj(w)\ 

+ 2 5Z {P-\'^iT2{x„w))nP\-\C{T2{x„w))nQ\}-\CiT2{x,,w))nP\ 

i=l 

(37) 

_'*»/i£m(.„.,)npiY ^3^^ 

i=l ^ ^ 

\adj{w)\ 

f3= Yl mT2ix,,w))nP\-\CiT2ix,,w))nQ\. (39) 

i=l 

The quartets X counted in n3{ui, Uk, w) are exactly those quartets for which there exist two 
neighbors x,y ofw such that either (i) X fl C{T2{x, w)) contains two leaves from P, and T2{y, w) 
contains two leaves from Q, or (ii) X fl C{T2{x, w)) and X fl C{T2{y, w)) both contain one leaf 
each from P and Q. The first term on the right hand side of Equation (|37l ) is exactly the number 
of quartets that satisfy condition (i). The sum in the second term on the right hand side counts 
the quartets satisfying condition (ii) exactly twice each (due to the symmetry between x and y in 
condition (ii)). This explains the | multiplicative factor. □ 

Lemma 8.8. The value 714 m^, w) can be computed in 0{ \ adj{w)\) time. 



Where 
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procedure Approx-T^i (Ti , T2 ) 
1: Convert the unrooted tree Ti into a rooted one by rooting it at any internal node. 
2: for each internal node u G V(Ti) \ rt(Ti) do 
3: for each internal unresolved node w G V(T2) do 
4: Compute r I {{u,pa{u)),w). 
5: return the sum of all computed ri(-, ■). 

Figure 3: Computing a 2-approximation to |7^i(ri, T2) | 



Proof. We will show that 



nc{T,{x„w))nP\\ 



V 2 



i=l 



2 / V 2 



+ ^ \C{T2ix,,w))nP\-\CiT2ix,,w))nQ\-\CiT2iw,x,))nP\-\CiT2iw,x.:)) 



i=l 



-2-n3{ui,Uk,w). (40) 

The quartets X counted in n4{ui, Uk, w) are exactly those quartets for which there exist three 
neighbors x, y, zofw such that either (i) Xn£(T2(x, w)) contains two leaves from P, and T2{y, w) 
and T2{z, w) each contain a leaf from Q, or (ii) X fl C{T2{x, w)) contains two leaves from Q, and 
Xr\C{T2{y, w)) and X r\C{T2{z, w)) each contain a leaf from P, or (iii) Xr\C{T2{x, w)) contains 
a leaf from P and a leaf from Q, X (1 C{T2{y, w)) contains a leaf from P, and X fl C{T2{z, w)) 
contains a leaf from Q. 

The first term on the right hand side of Equation (l40l) counts all the quartets that satisfy con- 
dition (i), and, in addition, all the quartets that satisfy condition (i) from the proof of Lemma [8Jl 
Similarly, the second term on the right hand side counts the quartets that satisfy condition (ii), 
along with all the quartets that satisfy condition (i) from the proof of Lemma [8771 The third term 
on the right hand side counts those quartets that satisfy condition (iii), and also counts, exactly 
twice each (again due to symmetry), those that satisfy condition (ii) from the proof of Lemma [8771 
Thus, by adding the first three terms on the right hand side of Equation ([40|) . we obtain the value 



Lemma 8.9. Given two unrooted phylogenetic trees Ti and T2 on the same size n leaf set, a value 
y such that |7?.i(Ti, T2) | < y < 2 ■ |7^i(Ti, T2) \ can be computed in 0{n^) time. 

Proof. Our algorithm to compute a 2-approximate value of | IZi (Ti , T2) | is summarized in Figure[3l 
Lemma [8^ immediately implies that the algorithm computes a value between |7^i(Ti, T2) \ and 

2■|7^l(^l,T2)|. 
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We now analyze the time complexity of our algorithm. By Lemmas |8.5[ I8.6[ [8771 and 18 .81 the 
values ni{ui,Uk,w), n2{ui,Uk,w), n^{ui,Uk,w), and n4{ui,Uk,w) can all be computed within 
0(1 adj{w)\) time. Hence, by Lemma l8.4l the value of any 7(-, •, w) can be computed in 0(| adj{w)\) 

time. Lemma lOl now implies that, for any given {u, v) E 8 (Ti) and w E V{T2)\C{T2), the value 
ri{(u, v),w) can be computed within 0(| adj{u) \ ■ \ adj{w)\) time. Thus, the total time complexity 
of the algorithm is 0{J2uev(Ti) Y^wavm I ^^(") I ' I ^di{w)\), which is Oin^). □ 

9 Discussion 

We have defined and analyzed distance measures for rooted and unrooted phylogenies that account 
for partially-resolved nodes. A number of problems remain. First, there is the question of de- 
termining whether there exists a polynomial-time algorithm for computing the median tree with 
respect to parametric triplet and quartet distances. We conjecture that this problem is NP-hard. 
Also open is the question of whether the Hausdorff distance between partially-resolved trees is 
NP-hard. Finally, many (if not most) applications require the comparison of trees that do not have 
the same set of taxa. It would be interesting to investigate whether any of our distance measures 
can be extended to this setting. 
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